This project will guide users through a data science pipeline with a tutorial I have made including: Data curation, parsing/management, exploratory data analysis, hypothesis testing, and machine learning.
The topic of this project will be anime. Anime is a Japanese animated entertainment medium filled with many genres and with such, a myriad of genres with their respective fanbases. This makes it a great topic of discussion for data science. There are many websites where one can watch anime such as http://www.crunchyroll.com/videos/anime or for the more savvy or daring, a certain site like “😘anime” or something along the lines of “🐈(the sound this creature makes in Japanese).si”
Let’s get started!
Well here goes…I-It’s not that I like you or anything, I just happened to do this…
Anime data is fairly accesible as there are multiple sources tracking seasonal releases with many users opting in to websites to rate and track their anime lists. There are many ways to access this data. You can parse the data yourself from websites, use an API provided by websites, or in our case, download prepared CSV (comma seperated values) to be used.
Although we are not curating our data this way, below is an example of how to use an API such as one from myanimelist. More info can be found here: https://myanimelist.net/modules.php?go=api.
library(httr)
darling <- GET(url = "https://myanimelist.net/api/anime/search.xml?q=Darling%20in%20%the%FranXX")
print(darling)
## Response [https://myanimelist.net/api/anime/search.xml?q=Darling%20in%20%the%FranXX]
## Date: 2018-05-15 04:01
## Status: 401
## Content-Type: text/html; charset=UTF-8
## Size: 19 B
# We get a 401 unauthorized error because you will need to veriify your credentials with myanimelist before using their API (not doing that here!!!).
Because we are lazy (smart 😏) we shall find some fine prepared data from our favorite dataset site, kaggle.com.
A throwaway account just for this project…
Ahh perfect, now that we have the account we can download datasets for free. I will be using this one: https://www.kaggle.com/CooperUnion/anime-recommendations-database/data
library(readr)
library(tidyr)
library(dplyr)
library(tidyverse)
anime <- read_csv("anime.csv")
anime %>%
head(12)
## # A tibble: 12 x 7
## anime_id name genre type episodes rating members
## <int> <chr> <chr> <chr> <chr> <dbl> <int>
## 1 32281 Kimi no Na wa. Drama, Romance~ Movie 1 9.37 200630
## 2 5114 Fullmetal Alche~ Action, Advent~ TV 64 9.26 793665
## 3 28977 Gintama° Action, Comedy~ TV 51 9.25 114262
## 4 9253 Steins;Gate Sci-Fi, Thrill~ TV 24 9.17 673572
## 5 9969 Gintama' Action, Comedy~ TV 51 9.16 151266
## 6 32935 Haikyuu!!: Kara~ Comedy, Drama,~ TV 10 9.15 93351
## 7 11061 Hunter x Hunter~ Action, Advent~ TV 148 9.13 425855
## 8 820 Ginga Eiyuu Den~ Drama, Militar~ OVA 110 9.11 80679
## 9 15335 Gintama Movie: ~ Action, Comedy~ Movie 1 9.10 72534
## 10 15417 Gintama': ~ Action, Comedy~ TV 13 9.11 81109
## 11 4181 Clannad: After ~ Drama, Fantasy~ TV 24 9.06 456749
## 12 28851 Koe no Katachi Drama, School,~ Movie 1 9.05 102733
ratings <- read_csv("rating.csv")
ratings %>%
head(12)
## # A tibble: 12 x 3
## user_id anime_id rating
## <int> <int> <int>
## 1 1 20 -1
## 2 1 24 -1
## 3 1 79 -1
## 4 1 226 -1
## 5 1 241 -1
## 6 1 355 -1
## 7 1 356 -1
## 8 1 442 -1
## 9 1 487 -1
## 10 1 846 -1
## 11 1 936 -1
## 12 1 1546 -1
The read_csv function is part of the package readr, read all about it here https://cran.r-project.org/web/packages/readr/README.html. We have also imported other useful libraries useful for managing and tidying dataframes.
We now have 2 sets of data, one for ratings, and one for anime. Although this data is fairly clean, we need to tidy it further. For example, a -1 rating indicates an unrated anime. This will impact calculations in the future so we should replace this.
Sometimes data is a bit dirty… As a beginner data scientist, it is your job to make sure your data is clean!
Kaho from Blend S
ratings[ratings== "-1"] <- NA
head(ratings,5)
## # A tibble: 5 x 3
## user_id anime_id rating
## <int> <int> <int>
## 1 1 20 NA
## 2 1 24 NA
## 3 1 79 NA
## 4 1 226 NA
## 5 1 241 NA
We can also tidy using another way. Here we use the mutate function to change the rating field conditonally using the ifelse function. This function works with having the first parameter being the conditional, the second argument being what happens if that condition is true, and the third being the false case. This would produce the same results as the previous method - the choice is yours.
ratings <- ratings %>%
mutate(rating = ifelse(rating == -1, NA, rating))
ratings %>%
head(5)
## # A tibble: 5 x 3
## user_id anime_id rating
## <int> <int> <int>
## 1 1 20 NA
## 2 1 24 NA
## 3 1 79 NA
## 4 1 226 NA
## 5 1 241 NA
“The goal of EDA is to perform an initial exploration of attributes/variables across entities/observations.” Hector Corrado Bravo
Ok, let’s get to it! First, we have to import the libraries tibble and ggplot2 for working with dataframes and plotting them.
library(tibble)
library(ggplot2)
ratings %>%
sample_frac(.01) %>%
rowid_to_column() %>%
ggplot(aes(x=rowid, y=rating)) +
geom_point()
Now we have a simple scatterplot. Confused? Don’t worry, sensei’s got your back. Here are the main points:
Yeah, there are way too many ratings, so much for a scatterplot - the dots have converged; it’s basically just a bunch of lines since there are way too many ratings squished together.
However, notice that at the bottom of the graph, the plot shows a bit more “scatter”. Basically, this means that there is less data of people giving lower scores.
This seems about right, I don’t really rate things low either. Perhaps the reason for this is if one were to have rated an anime low, they would have just dropped the show earlier on or not given it a rating at all.
Let us picture the distribution a bit more clear.
ratings %>%
sample_frac(.01) %>%
arrange(rating) %>%
rowid_to_column() %>%
ggplot(aes(x=rowid, y=rating)) +
geom_point() +
ggtitle("A better scatterplot")
The arrange function allows us to sort data, in this case by ratings.
Hmmm…seems in line with what we explained to ourselves above. There are very few low ratings and much more high ratings. Just take a look at how thicc the lines are with ratings around 7 to 9.
Before we do anything too advanced we should take a look at simple summary statistics. We can look at the plots all we want but let’s just get straight to the numbers. With the code below we can see the obvious max_rating of 10 and the min_rating of 1.
We filter out the data that is NA to remove all the useless data that significantly impacts the speed of the code.
ratings <- ratings %>%
filter(!is.na(rating))
ratings %>%
summarize(min_rating = min(rating),
max_rating = max(rating),
median_rating=median(rating),
mean_rating=mean(rating))
## # A tibble: 1 x 4
## min_rating max_rating median_rating mean_rating
## <dbl> <dbl> <int> <dbl>
## 1 1.00 10.0 8 7.81
The mean rating is 7.8 and the median is 8. Interesting fact: My mean rating on myanimelist is actually 7.67 (seems like I rate similar to the average anime watcher, perhaps a tiny bit more judgemental).
We have the summary statistics; now we shall visualize the median of this on a histogram in order to better picture the distribution of ratings. The median is the 50th percentile. Half of the data is below it, and half is above it. Let’s take a look at where this falls (median in vertical line):
ratings %>%
filter(!is.na(rating)) %>%
ggplot(aes(x=rating)) +
geom_histogram(bins=100)+
ggtitle("Histogram with median") +
geom_vline(aes(xintercept=median(rating)), color="green")
Anime genres vary on a large scale, and with so, we would expect their ratings to be distriubted differently across genres.
If you want to see an example of what types of genres are out there and what they mean, check this link out: https://myanimelist.net/info.php?go=genre.
What better way to see this than with boxplots?
anime %>%
mutate(genre = map(genre, ~ strsplit(.x, ", ") %>% unlist())) %>%
unnest(genre) ->
anime_genre
anime_genre
## # A tibble: 36,347 x 7
## anime_id name type episodes rating members genre
## <int> <chr> <chr> <chr> <dbl> <int> <chr>
## 1 32281 Kimi no Na wa. Movie 1 9.37 200630 Drama
## 2 32281 Kimi no Na wa. Movie 1 9.37 200630 Romance
## 3 32281 Kimi no Na wa. Movie 1 9.37 200630 School
## 4 32281 Kimi no Na wa. Movie 1 9.37 200630 Supernat~
## 5 5114 Fullmetal Alchemist: ~ TV 64 9.26 793665 Action
## 6 5114 Fullmetal Alchemist: ~ TV 64 9.26 793665 Adventure
## 7 5114 Fullmetal Alchemist: ~ TV 64 9.26 793665 Drama
## 8 5114 Fullmetal Alchemist: ~ TV 64 9.26 793665 Fantasy
## 9 5114 Fullmetal Alchemist: ~ TV 64 9.26 793665 Magic
## 10 5114 Fullmetal Alchemist: ~ TV 64 9.26 793665 Military
## # ... with 36,337 more rows
anime_genre %>%
ggplot(mapping=aes(x=genre,y=rating , fill=genre))+
theme_set(theme_gray(base_size = 7))+
ggtitle("Boxplot: Ratings across genres")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
geom_boxplot()
OK let’s talk about this. First, we had to split the anime shows by genre in the anime dataset. Some anime have muliple genres; the first chunk of code breaks up these genres as a seperate entry in a new data frame anime_genre. We then use this new data frame as a dataset for a series of boxplots that show the distribution of anime ratings across genres. Here are some interesting ones to note:
Time to test a hypothesis… Let’s come up with something simple for a null hypothesis: The average anime rating is a 7 or greater.
There are many ways to test a hypothesis, one way being via critical values and another being via p-values. We will be using p-values.
Before doing any testing, we must set a significance level. We will be using a significance level of alpha = 0.05. This means that the null hypothesis is rejected 5% of the time when it should not have been. If you want to learn more about p-values feel free to check out this link for dummies below:
http://www.dummies.com/education/math/statistics/what-a-p-value-tells-you-about-statistical-data/
Time to learn.
Now we must calculate a p-value. We will do so using R, don’t worry we shall explain after.
sample_mean = 7.8
hypothesis_value = 7
standard_deviation = sd(ratings$rating)
pval <- pnorm(hypothesis_value,sample_mean,standard_deviation, lower.tail=FALSE)
pval
## [1] 0.6945346
The p-value is 0.69. This means that 69% of the data falls above our hypothesis that the average anime rating is a 7. Because this is greater than our significance level of alpha = 0.05 we do not reject the null hypothesis that the average anime rating is a 7 or greater.